Week 1: Exploratory Data Analysis and Feature Engineering¶
Xiangfan Song s2883206, Yu Lin s2803592, Wenhao Xie s2801571
Aims¶
By the end of this notebook you will
- understand and play with the different aspects of data pre-processing
- be familiar with tools for exploratory data analysis and visualization
- understand the basics of feature engineering
- build your first pipeline
Topics and Instructions¶
In lecture this week, we reviewed the general machine learning pipline, which following the "Machine Learning Project Checklist" of Geron (2019) can be stuctured as:
- Frame the problem and look at the big picture.
- Get the data.
- Explore the data and gain insights.
- Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
- Explore many different models and shortlist the best ones.
- Fine-tune your models and combine them into a great solution.
- Present your solution.
- Launch, monitor, and mantain your system.
In this week's workshop, we will focus on the initial steps of this pipeline, that is on, data pre-processing, exploratory data analysis and feature engineering.
During workshops, you will complete the worksheets together in teams of 2-3, using pair programming. During the first few weeks, the worksheets will contain cues to switch roles between driver and navigator. When completing worksheets:
- You will have tasks tagged by (CORE) and (EXTRA).
- Your primary aim is to complete the (CORE) components during the WS session, afterwards you can try to complete the (EXTRA) tasks for your self-learning process.
- Look for the 🏁 as cue to switch roles between driver and navigator.
- In some Exercises, you will see some beneficial hints at the bottom of questions.
Instructions for submitting your workshops can be found at the end of worksheet. As a reminder, you must submit a pdf of your notebook on Learn by 16:00 PM on the Friday of the week the workshop was given.
Problem Definition and Setup ¶
Packages¶
Now lets load in some packages to get us started. The following are widely used libraries to start working with Python in general.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
If you need to install any packages from scratch, you need to install the related library before calling it. For instance, feature-engine is a Python library for Feature Engineering and Selection, which:
contains multiple transformers to engineer and select features to use in machine learning models.
preserves scikit-learn functionality with methods fit() and transform() to learn parameters from and then transform the data (we will learn more about these throughout the course!).
# To install the feature-engine library (if not already installed)
!pip install feature-engine
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/ Requirement already satisfied: feature-engine in f:\anaconda\envs\mlp\lib\site-packages (1.9.3) Requirement already satisfied: numpy>=1.18.2 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (1.26.4) Requirement already satisfied: pandas>=2.2.0 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (2.3.0) Requirement already satisfied: scikit-learn>=1.4.0 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (1.7.0) Requirement already satisfied: scipy>=1.4.1 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (1.15.3) Requirement already satisfied: statsmodels>=0.11.1 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (0.14.6) Requirement already satisfied: python-dateutil>=2.8.2 in f:\anaconda\envs\mlp\lib\site-packages (from pandas>=2.2.0->feature-engine) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in f:\anaconda\envs\mlp\lib\site-packages (from pandas>=2.2.0->feature-engine) (2025.2) Requirement already satisfied: tzdata>=2022.7 in f:\anaconda\envs\mlp\lib\site-packages (from pandas>=2.2.0->feature-engine) (2025.3) Requirement already satisfied: six>=1.5 in f:\anaconda\envs\mlp\lib\site-packages (from python-dateutil>=2.8.2->pandas>=2.2.0->feature-engine) (1.17.0) Requirement already satisfied: joblib>=1.2.0 in f:\anaconda\envs\mlp\lib\site-packages (from scikit-learn>=1.4.0->feature-engine) (1.5.3) Requirement already satisfied: threadpoolctl>=3.1.0 in f:\anaconda\envs\mlp\lib\site-packages (from scikit-learn>=1.4.0->feature-engine) (3.6.0) Requirement already satisfied: patsy>=0.5.6 in f:\anaconda\envs\mlp\lib\site-packages (from statsmodels>=0.11.1->feature-engine) (1.0.2) Requirement already satisfied: packaging>=21.3 in f:\anaconda\envs\mlp\lib\site-packages (from statsmodels>=0.11.1->feature-engine) (25.0)
In some cases, we may need only a component of the whole library. If this is the case, it is possible to import specific things from a module (library), using the following line of code:
from feature_engine.imputation import EndTailImputer
Problem¶
Now, it is time move on to the next step.
You are asked to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short.
Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.
The first question to ask your boss is what exactly is the business objective; building a model is probably not the end goal. How does the company expect to use and benefit from this model? This is important because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to evaluate your model, and how much effort you should spend tweaking it.
The next question to ask is what the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem. Your boss answers that the district housing prices are currently estimated manually by experts: a team gathers up-to-date information about a district, and when they cannot get the median housing price, they estimate it using complex rules.
This is costly and time-consuming, and their estimates are not great; in cases where they manage to find out the actual median housing price, they often realize that their estimates were off by more than 20%. This is why the company thinks that it would be useful to train a model to predict a district’s median housing price given other data about that district. The census data looks like a great dataset to exploit for this purpose, since it includes the median housing prices of thousands of districts, as well as other data.
🚩 Exercise 1 (CORE)¶
Using the information above answer the following questions about how you may design your machine learning system.
a) Is this a supervised or unsupervised learning task?
Type your answer here!
b) Is this a classification, regression, or some other task?
Type your answer here!
c) Suppose you are only required to predict if a district's median housing prices are "cheap," "medium," or "expensive". Will this be the same or a different task?
Type your answer here!
Data Download¶
The data we will be using this week is a modified version of the California Housing dataset. We can get the data a number of ways. The easiest is just to load it from the working directory that we are working on (where we have already downloaded it to).
housing = pd.read_csv("housing.csv")
housing.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
Exploratory Data Analysis ¶
In this section we are going to start with exploring the California Housing data using methods that you will likely already be familiar with.
Data can come in a broad range of forms encompassing a collection of discrete objects, numbers, words, events, facts, measurements, observations, or even descriptions of things. Processing data using exploratory data analysis (EDA) can elicit useful information and knowledge by examining the available dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions.
Let's start by examining the Data Dictionary and the variables available:
longitude: A measure of how far west a house is; a higher value is farther west
latitude: A measure of how far north a house is; a higher value is farther north
housingMedianAge: Median age of a house within a block; a lower number is a newer building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households, a group of people residing within a home unit, for a block
medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US Dollars)
oceanProximity: Location of the house w.r.t ocean/sea
# Code for your answer here!
housing.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
Type your answer here!
b) From the information provided above, can you anticipate any data cleaning we may need to do?
4 total bedrooms has null and we need to encode the categorical ocean_proximity by one-hot encoding
🚩 Exercise 3 (CORE)¶
a) Use descriptive statistics and histograms to examine the distributions of the numerical attributes.
Hint
.describe()can be used to create summary descriptive statistics on a pandas dataframe.- You can use a
sns.histplotto create histograms
# Code for your answer here!
import matplotlib.pyplot as plt
housing.describe()
housing.hist(bins = 500, figsize = (10,10))
plt.tight_layout()
plt.show()
b) Can you identify other pre-processing/feature engineering steps we may need to do? Which variables represent counts and how are they distributed?
1,Missing values; encoding by one-hot encoding; using log1p handle very skewed counts 2,All these variable except ocean_proximity,right - skewed
c) One thing you may have noticed from the histogram is that the median income, housing median age, and the median house value are capped. The median house value capping (this being our target value) may or may not be a problem depending on your client. If we needed precise predictions beyond $\$500,000$, we may need to either collect proper labels/outputs for the districts whose labels were capped or remove these districts from the data. Following the latter, remove all districts whose median house value is capped. How many observations are there now?
# Code for your answer here!
# Remove the cases where median_house_value >= 500,000$
cap = housing["median_house_value"].max()
housing_no_cap = housing[housing["median_house_value"]<cap]
housing_no_cap.shape[0]
19675
🚩 Exercise 4 (CORE)¶
What are the possible categories for the ocean_proximity variable? Are the number of instances in each category similar?
Hint
value_counts()can be used to count the values of the categories a pandas series.- You can use a
sns.countplotto create barplot with the number of instances of each category
housing["ocean_proximity"].unique() # 所有类别
housing["ocean_proximity"].value_counts() # 每类数量
housing["ocean_proximity"].value_counts(normalize=True) # 每类比例
ocean_proximity <1H OCEAN 0.442636 INLAND 0.317393 NEAR OCEAN 0.128779 NEAR BAY 0.110950 ISLAND 0.000242 Name: proportion, dtype: float64
Type your answer here!
🏁 Now, is a good point to switch driver and navigator
🚩 Exercise 5 (CORE)¶
Examine if/which of the features are correlated to each other. Are any of the features correlated with our output (median_house_value) variable?
Can you think of any reason why certain features may be correlated?
How might we use this information in later steps of our model pipeline?
Hint
.corr()can be used to compute the correlations.- You can use a
sns.heatmapto visualize the correlations
# Code for your answer here!
# correlation matrix (numeric only)
corr = housing.corr(numeric_only=True)
# heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.tight_layout()
plt.show()
Type your answer here! median_income has a strong relationship with median house value People who earn more tend to buy moreexpensive houses later we can reduce multicollinearity
🚩 Exercise 6 (CORE)¶
Use sns.pairplot to further investigate the joint relationship between each pair of variables. What insights into the data might this provide over looking only at the correlation?
# Code for your answer here!
sns.pairplot(housing, diag_kind="hist", plot_kws={"alpha": 0.3, "s": 15})
plt.tight_layout()
plt.show()
Type your answer here!
Data Pre-Processing ¶
Now we have some familiarity with the data though EDA, lets start preparing our data to be modelled.
Data Cleaning¶
Let's start with some basic data cleaning steps. For example, we may want to:
- deal with duplicated, inconsitencies or typos in the data,
- handle missing data,
- remove uninformative features (e.g. subject identifiers),
- fix variable types,
- adjust data codes (e.g. missing variables may be coded as ‘999’ instead NA),
- optionally remove outliers.
Let's start with the former.
Data Duplication and Errors¶
We want to remove duplicates, that may have accidently been entered in the database twice, as they may bias our fitted model. In other words, we may potentially overfit to this subset of points. However, care should usually be taken to check they are not real data with identical values.
There a number of ways we could identify duplicates, the simplist (and the approach we'll focus on) is just to find observations with the same feature values. Of course this will not identify things such as spelling errors, missing values, address changes, use of aliases, etc. This may commonly happen with categorical or text data, and checking the unique values is recommended. In general for such errors, more complicated methods along with manual assessment may be needed.
🚩 Exercise 7 (CORE)¶
a) Are there any duplicated values in the data? If so how many?
Hint
With Pandas dataframes you can use `.duplicated()` to get a boolean of whether something is a duplicate and then use `.sum()` to count how many there are.b) What are the unique values of the categorical variable? Are there any duplicated categories arising from misspellings?
# Code for your answer here!
housing.duplicated().sum()
housing["ocean_proximity"].unique()
housing["ocean_proximity"].value_counts(dropna=False)
ocean_proximity <1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5 Name: count, dtype: int64
Type your answer here!
Outlier Detection¶
An Outlier is a data point that lies abnormally far from other observations and may distort the model fit and results. In general, it is a good idea to examine if any outliers are present during preprocessing. In some cases, you may want to drop these observations or cap their values (see https://feature-engine.trainindata.com/en/1.8.x/api_doc/outliers/index.html). However this may not be appropriate without explicit knowledge and testing if they are really outliers or not. In particular, when you drop or cap those observations you can discard important information unwittingly!
We will use basic statistics in order to try to identify outliers. A simple method of detecting outliers is to use the inter-quartile range (IQR) proximity rule (Tukey fences) which states that a value is an outlier if it falls outside these boundaries:
Upper boundary = 75th quantile + (IQR * $k$)
Lower boundary = 25th quantile - (IQR * $k$)
where IQR = 75th quantile - 25th quantile (the length of the box in the boxplot). This is used to construct the whiskers in the boxplot, where $k$ is a nonnegative constant which is typically set to 1.5 (the default value in sns.boxplot). However, it is also common practice to find extreme values by setting $k$ to 3.
🚩 Exercise 8 (EXTRA)¶
a) Can you identify any potential outliers using the generated boxplots below? Do you think any points should be removed?
b) Try changing $k$, defining the length of the whiskers, to 3 in sns.boxplot. Can you still identify any potential outliers?
fig, axes = plt.subplots(figsize = (15,10), ncols = (housing.shape[1]-1)//3, nrows = 3, sharex = True)
axes = axes.flatten()
#features_number := housing.shape[1]; ocean_proximity can't use := -1; // :=columns; 每行多少列
for i, ax in enumerate(axes):
sns.boxplot(y = housing.iloc[:,i], ax = ax, whis=3)
ax.set_title(housing.iloc[:,i].name)
ax.set_ylabel("")
plt.suptitle("Boxplots")
plt.tight_layout()
plt.show()
Type your answer here!
🏁 Now, is a good point to switch driver and navigator
Missing Data¶
Most ML models cannot handle missing values, and as we saw earlier, there are some present in total_bedrooms. We also saw that values of median_house_value are capped at $\$500,000$. This is another form of missingness, which is informative for missing values (i.e. the missing values are greater than $\$500,000$). However, we will focus on methods for dealing with missingness in our features and not the target variable.
As such, let's start by splitting our features from our target variable in the data set.
# Extracting the features from the data
X = housing.drop("median_house_value", axis = 1)
features = list(X.columns)
print(features)
print(X.shape)
display(X.head())
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'ocean_proximity'] (20640, 9)
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | NEAR BAY |
# Extracting the target features from the data
y = housing["median_house_value"].copy()
print(y.shape)
display(y.head())
(20640,)
0 452600.0 1 358500.0 2 352100.0 3 341300.0 4 342200.0 Name: median_house_value, dtype: float64
There are a number of ways we can deal with missing values. The simplest is to just remove NA values. We can do this in two ways by either:
- Getting rid of the corresponding observations (deleting the corresponding rows).
- Getting rid of the whole attribute (deleting the corresponding columns).
To is relatively straight forward by running housing.dropna() with either the axis set to 0 or 1 (depending if we want to remove rows or columns) before splitting our data into features (X) and outputs (y).
🚩 Exercise 9 (CORE)¶
Use dropna() to remove the missing observations. What is the shape of the feature matrix after dropping the missing observations?
Notes
- It may be tempting to overwrite
Xwhile working on our pre-processing steps. Don't do this! We will run these objects through our pipeline which combines missing data steps with other steps later, so if you want to test your function make sure to assign the output to tempory objects (e.g.X_).
# Code for your answer here!
# 不要覆盖 X
X_ = X.dropna()
X_.shape
print(X_.shape)
(20433, 9)
Instead of simply dropping missing data, we may instead want to use other imputation methods. From here on in, we will be creating functions for our data transformations. Later, we will see why this is really useful to define our model pipeline, which allows us to chain together transformations and steps in a reproducible way.
In this course we are mostly going to be using Scikit-learn, with a little Keras at the end for neural networks. Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning (https://scikit-learn.org/stable/getting_started.html). It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
We will first focus on the transformer class within Scikit-learn, which provides functions for missing data imputation along with many others useful for data pre-processing and feature engineering.
Transfomers¶
If we want to alter the features of our data, we need a transfomer.
Transformers are classes that follow the scikit-learn API in Scikit-Learn clean, impute, reduce, expand, or generate feature representations.
Transformers are classes with a
.fit()method, which learn model parameters (e.g. mean for mean imputation) from a training set, and a.transform()method which applies this transformation model to data. To create a custom transformer, all you need is to create a class that implements three methods:fit(),transform(), andfit_transform().
Therefore to transform a dataset, each sampler implements:
obj.fit(data)
data_transformed = obj.transform(data)
or simply...
data_transformed = obj.fit_transform(data)`
See more details: https://scikit-learn.org/stable/data_transforms.html. In the following subsections, we will see examples of transformers for categorical and numerical variables.
Data Imputation¶
Instead of removing the missing data we can set it to some value. To do this, Scikit-Learn provides various transformers, including:
SimpleImputerwhich provides simple strategies (e.g."mean","median"for numerical features and"most_frequent"for categorical features).- You can also add a missing indicator with the option
add_indicator=TrueinSimplerImputer, or use the transfomer `MissingIndicator'. This may be useful in the case when missing features may be provide information for predicting the target (e.g. obese patients may prefer not to report bmi, thus, this missingness could be useful for estimating the risk of health conditions or diseases). - Beyond simple imputation strategies, sklearn also provides more advanced imputation strategies in
IterativeImputerandKNNImputer - Other strategies are also available in
feature_engine.imputation. Such asEndTailImputer, which is useful when missing values are located in the tails (e.g. capped values for privacy)
Let's start with the SimpleImputer to learn about transfomers and how to deal with missing data in sklearn.
from sklearn.impute import SimpleImputer
# First create the imputer object/transformer
num_imputer = SimpleImputer(strategy="median")
# Now fit the object to the data
num_imputer.fit(X)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[77], line 7 4 num_imputer = SimpleImputer(strategy="median") 6 # Now fit the object to the data ----> 7 num_imputer.fit(X) File f:\anaconda\envs\mlp\Lib\site-packages\sklearn\base.py:1363, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs) 1356 estimator._validate_params() 1358 with config_context( 1359 skip_parameter_validation=( 1360 prefer_skip_nested_validation or global_skip_validation 1361 ) 1362 ): -> 1363 return fit_method(estimator, *args, **kwargs) File f:\anaconda\envs\mlp\Lib\site-packages\sklearn\impute\_base.py:436, in SimpleImputer.fit(self, X, y) 418 @_fit_context(prefer_skip_nested_validation=True) 419 def fit(self, X, y=None): 420 """Fit the imputer on `X`. 421 422 Parameters (...) 434 Fitted estimator. 435 """ --> 436 X = self._validate_input(X, in_fit=True) 438 # default fill_value is 0 for numerical input and "missing_value" 439 # otherwise 440 if self.fill_value is None: File f:\anaconda\envs\mlp\Lib\site-packages\sklearn\impute\_base.py:361, in SimpleImputer._validate_input(self, X, in_fit) 355 if "could not convert" in str(ve): 356 new_ve = ValueError( 357 "Cannot use {} strategy with non-numeric data:\n{}".format( 358 self.strategy, ve 359 ) 360 ) --> 361 raise new_ve from None 362 else: 363 raise ve ValueError: Cannot use median strategy with non-numeric data: could not convert string to float: 'NEAR BAY'
Unfortunately, when we applied this to our data, we get the following error:
ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float:
This is because the "median" strategy can only be used with numerical attributes so we need a way of only applying imputation to certain attributes. We could temporarily remove the categorical feature from our data to apply our function, or apply the function to a subset of the data and assign the output to the same subset.
However scikit-learn has a handy function to specify what column we want to apply a function to!
from sklearn.compose import ColumnTransformer
# Names of numerical columns
numcols = features[:-1]
print(numcols)
catcols = [features[-1]]
print(catcols)
num_cols_imputer = ColumnTransformer(
# apply the `num_imputer` to all columns apart from the last
[("num", num_imputer, numcols)],
#("cat", cat_imputer, catcols) also work
# don't touch all other columns, instead concatenate it on the end of the
# changed data.
remainder = "passthrough"
)
num_cols_imputer.fit(X)
#use fit() to learn dataframe and later use fit_transform(X)
# Print the median values computed by calling fit
print("Computed median values for each numerical feature:")
print(num_cols_imputer["num"].statistics_)
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
['ocean_proximity']
Computed median values for each numerical feature:
[-118.49 34.26 29. 2127. 435. 1166. 409.
3.5348]
After using .fit, our object now has a number of attributes, including statistics_ which stores the median value for each numerical attribute on the training set. This value can be used when validating and testing the model as it will be used if there is missing data in the new data.
Note
- The fitted
ColumnTransformercontains a list of transformers, stored in the attributetransformers_. We named the first transformer in the listnum. Try runningnum_cols_imputer.transformers_to see the names and types of the transformers in the list. - To access the fitted
num_imputerin this case,num_cols_imputer["num"]is a shortcut to access the named transformer in the list.
Now, let's call transform to our fitted objected to impute the missing values.
X_ = num_cols_imputer.transform(X)
print("Number of Missing Values")
pd.DataFrame(X_, columns = features).isna().sum()
Number of Missing Values
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 0 population 0 households 0 median_income 0 ocean_proximity 0 dtype: int64
🚩 Exercise 10 (CORE)¶
In addition to median imputation, alter your transformer to also include a missing indicator. What is the shape of the transformed feature matrix? Use the method .get_feature_names_out() to print the names of the new features.
Note: You may want to add the option verbose_feature_names_out = False in your ColumnTransformer to reduce the length of the feature names.
# Code for your answer here!
num_imputer = SimpleImputer(strategy="median", add_indicator=True)
ct = ColumnTransformer(
[("num", num_imputer, numcols)],
remainder="drop",
verbose_feature_names_out=False,
)
X_t = ct.fit_transform(X)
print(X_t.shape)
print(ct.get_feature_names_out())
(20640, 9) ['longitude' 'latitude' 'housing_median_age' 'total_rooms' 'total_bedrooms' 'population' 'households' 'median_income' 'missingindicator_total_bedrooms']
🏁 Now, is a good point to switch driver and navigator
Feature Engineering ¶
As discussed in the lectures, feature engineering is where we extract features from data and transform them into formats that are suitable for machine learning models. Today, we will have a look at two main cases that are present in our data: categorical and numerical values.
Feature engineering also requires a transformer class to alter the features.
Categorical Variables¶
In the dataset, we have an text attribute (
ocean_proximity) that we already had to treat differently when cleaning and visualizing the data. This extends to feature engineering as well, where we need to use separate methods than those used with numerical variables.If we look at the unique values of this attribute, we will see that there are a limited number of possible values which represent a category. We need a way of encoding this information into our modeling framework by converting our string/categorical variable into a numeric representation that can be included in our models.
If we have a binary categorical variable (two levels) we could do this by picking one of the categorical levels and encode it as 1 and the other level as 0.
However, in this case as we have multiple categories, we would probably want to use another encoding method. To illustrate, we can try encoding the the categorical feature ocean_proximity using both the OrdinalEncoder and OneHotEncoder available in sklearn.preprocessing.
Side Notes
The output of the
OneHotEncoderprovided in Scikit-Learn is a SciPy sparse matrix, instead of a NumPy array. These are useful when you have lots of categories as your matrix becomes mostly full of 0's. To store all these 0's takes up unneccesary memory, so instead a sparse matrix just stores the location of nonzero elements. The good news is that you can use a sparse matrix similar to a numpy matrix, but if you wanted to, you can convert it to a dense numpy matrix using.toarray().The above does not seem to be the case if passed through a
ColumnTransformer.
Comment by sone: sparse matrix is better than numpy if a dataframe have lots of zero. In One-hot Encoder one categorical feature with 4 unique values become 4 one-hot columns.
Comment by Sone: ColumnTransformer lets you apply different preprocessing steps to different columns.
To impute missing values in numerical columns, use a SimpleImputer in a "num" transformer.
To encode categorical columns, use an encoder (e.g., OneHotEncoder or OrdinalEncoder) in a "cat" transformer.
from sklearn.preprocessing import OrdinalEncoder
# Defining the OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
encoder = ColumnTransformer(
# apply the ordinal_encoder to the last column
[("cat", ordinal_encoder, catcols)],
remainder="passthrough",
verbose_feature_names_out=False)
# fitting the encoder defined above
X_ = encoder.fit_transform(X)
# Accessing the fitted ordinal encoder (encoder["cat"]) to see how the categories were mapped
display(dict(zip(list(encoder["cat"].categories_[0]), range(5))))
# Display the first few rows of the transformed data
display(pd.DataFrame(X_, columns = encoder.get_feature_names_out()).head())
{'<1H OCEAN': 0, 'INLAND': 1, 'ISLAND': 2, 'NEAR BAY': 3, 'NEAR OCEAN': 4}
| ocean_proximity | longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 |
| 1 | 3.0 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 |
| 2 | 3.0 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 |
| 3 | 3.0 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 |
| 4 | 3.0 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 |
from sklearn.preprocessing import OneHotEncoder
# Defining the OneHotEncoder
onehot_encoder = OneHotEncoder()
encoder = ColumnTransformer(
# apply the onehot_encoder to the last column
[("cat", onehot_encoder, catcols)],
remainder="passthrough",
verbose_feature_names_out=False)
X_ = encoder.fit_transform(X)
# Display the first few rows of the transformed data
display(pd.DataFrame(X_, columns = encoder.get_feature_names_out()).head())
| ocean_proximity_<1H OCEAN | ocean_proximity_INLAND | ocean_proximity_ISLAND | ocean_proximity_NEAR BAY | ocean_proximity_NEAR OCEAN | longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 |
| 1 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 |
| 3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 |
Comment by Sone:
OrdinalEncoder disadvantage:
Imposes an artificial order on categories
Wrong distance effects(especially for linear models,K-NN)
OneHotEncoder disadvantages: Expands to many columns Can overfit when there are many rare categories.(using target encoder)
🚩 Exercise 11 (CORE)¶
a) What is the main difference between two methods regarding the obtained features? Which encoding method do you think is most appropriate for this variable and why?
b) How sensible is the default ordering of the ordinal encoder? Use the parameter categories of OrdinalEncoder to apply a different ordering.
(a) Main differences:
One-Hot: Expands into many columns, does not impose any order
Ordinal: maps categories to a single integer column, imposing an order and distances
Type your answer here!
# (b)Code for your answer here!
order = [['INLAND', '<1H OCEAN', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND']]
ord_enc = OrdinalEncoder(categories=order)
encoder = ColumnTransformer(
[("cat", ord_enc, catcols)],
remainder="passthrough",
verbose_feature_names_out=False
)
X_b = encoder.fit_transform(X)
print(X_b)
[[ 3.0000e+00 -1.2223e+02 3.7880e+01 ... 3.2200e+02 1.2600e+02 8.3252e+00] [ 3.0000e+00 -1.2222e+02 3.7860e+01 ... 2.4010e+03 1.1380e+03 8.3014e+00] [ 3.0000e+00 -1.2224e+02 3.7850e+01 ... 4.9600e+02 1.7700e+02 7.2574e+00] ... [ 0.0000e+00 -1.2122e+02 3.9430e+01 ... 1.0070e+03 4.3300e+02 1.7000e+00] [ 0.0000e+00 -1.2132e+02 3.9430e+01 ... 7.4100e+02 3.4900e+02 1.8672e+00] [ 0.0000e+00 -1.2124e+02 3.9370e+01 ... 1.3870e+03 5.3000e+02 2.3886e+00]]
🚩 Exercise 12 (EXTRA)¶
Another handy feature of OneHotEncoder and OrdinalEncoder is that infrequent categories can be aggregated into a single feature/value. The parameters to enable the gathering of infrequent categories are min_frequency and max_categories.
Use the max_categories attribute to set the maximum number of categories to 4. Use the get_feature_names_out() method of OneHotEncoder to print the new category names. Which two features have been combined?
# Code for your answer here!
# Haven't use ColumnTransformer, since we only used ColumnTransformer when encoding needs to combined many columns
enc = OneHotEncoder(max_categories=4, handle_unknown="ignore")
enc.fit(housing[["ocean_proximity"]])
print(enc.get_feature_names_out(["ocean_proximity"]))
print(enc.infrequent_categories_)
['ocean_proximity_<1H OCEAN' 'ocean_proximity_INLAND' 'ocean_proximity_NEAR OCEAN' 'ocean_proximity_infrequent_sklearn'] [array(['ISLAND', 'NEAR BAY'], dtype=object)]
Type your answer here!
🚩 Exercise 13 (EXTRA)¶
When there are many unordered categories, another useful encoding scheme is TargetEncoder which uses the target mean conditioned on the categorical feature for encoding unordered categories. Whereas one-hot encoding would greatly inflate the feature space if there are a very large number of categories (e.g. zip code or region), TargetEncoder is more parsimonious.
Use target encoding of ocean proximity. What are the numerical values assigned to the categories?
Caution: when using this transformer, be careful to avoid data leakage and overfitting by integrating it properly in your model pipeline! We will learn more about this later.
# Code for your answer here!
# Create a target encoder
from sklearn.preprocessing import TargetEncoder
tgc = TargetEncoder(target_type="continuous")
tgc.fit(X[["ocean_proximity"]],y)
cats = tgc.categories_[0]
vals = tgc.transform(pd.DataFrame({"ocean_proximity": cats})).ravel()
mapping = dict(zip(cats,vals))
mapping
{'<1H OCEAN': 240081.20980262995,
'INLAND': 124810.00113345706,
'ISLAND': 367882.7345151113,
'NEAR BAY': 259186.43556292081,
'NEAR OCEAN': 249415.94571004034}
Numerical Variables¶
Feature Scaling¶
As we will discuss in later weeks, many machine learning algorithms are sensitive to the scale and magnitude of the features, and especially differences in scales across features. For these algorithms, feature scaling will improve performance.
For example, let's investigate the range of the features in our dataset:
fig, ax = plt.subplots(figsize=(15,5))
plt.boxplot(X[numcols], vert = False)
plt.xscale("symlog")
plt.ylabel("Feature")
plt.xlabel("Feature Range")
ax.set_yticklabels(numcols)
plt.suptitle("Feature Range for the Training Set")
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(figsize = (15,10), ncols = (housing.shape[1]-1)//3, nrows = 3, sharex = True)
axes = axes.flatten()
#features_number := housing.shape[1]; ocean_proximity can't use := -1; // :=columns; 每行多少列
for i, ax in enumerate(axes):
sns.boxplot(y = housing.iloc[:,i], ax = ax, whis=3)
ax.set_title(housing.iloc[:,i].name)
ax.set_ylabel("")
plt.suptitle("Boxplots")
plt.tight_layout()
plt.show()
!!! Differences:
phase1: ->only numcols; many box-plots in a insight
There are various options in scikit learn for feature scaling, including:
Standardization (
preprocessing.StandardScaler)Min-Max Scaling (
preprocessing.MinMaxScaler)l2 Normalization (
preprocessing.normalize)RobustScaler(
preprocessing.RobustScaler)Scale with maximum absolute value (
preprocessing.MaxAbsScaler)
- As scaling generally improves the performance of most models when features cover a range of scales, it is probably a good idea to apply some sort of scaling to our data before fitting a model.
- Standardization (or variance scaling), is the most common, but there are a number of other types, as listed above.
🚩 Exercise 14 (CORE)¶
Try implementing at least two different scalers for the total_rooms and total_bedrooms variables. Make a scatter plot of the original and transformed features to see the main differences.
# Code for your answer here!
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler
cols = ["total_rooms", "total_bedrooms"]
X_raw = X[cols].copy()
X_std = StandardScaler().fit_transform(X_raw)
X_mm = MinMaxScaler().fit_transform(X_raw)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].scatter(X_raw[cols[0]], X_raw[cols[1]], s=8, alpha=0.3)
axes[0].set_title("Original")
axes[0].set_xlabel(cols[0])
axes[0].set_ylabel(cols[1])
axes[1].scatter(X_std[:, 0], X_std[:, 1], s=8, alpha=0.3)
axes[1].set_title("StandardScaler")
axes[1].set_xlabel(cols[0])
axes[1].set_ylabel(cols[1])
axes[2].scatter(X_mm[:, 0], X_mm[:, 1], s=8, alpha=0.3)
axes[2].set_title("MinMaxScaler")
axes[2].set_xlabel(cols[0])
axes[2].set_ylabel(cols[1])
plt.tight_layout()
plt.show()
#StandardScarlar: every columns mean value = 0; variance = 1
#MinMaxScalar: every columns from 0 to 1
Power Transformation¶
In some cases, we may wish to apply transformations to our data, so that they have a more Gaussian distribution. For example, log transformations are useful for altering count data to have a more normal distribution as they pull in the more extreme high values relative to the median, while stretching back extreme low values away from the median. You can use a log transformation with either the pre-made LogTransformer() from feature_engine.transformation, or a custom function and sklearn.preprocessing.FunctionTransformer.
More generally, the natural logarithm, square root, and inverse transformations are special cases of the Box-Cox family of transformations (Box and Cox 1964). The question is why do we need such a transformation and when?
Note that, the method is typically used to transform the outcome, but can also be used to transform predictors.
The method assumes that the variable takes only positive values. If there are any zero or negative values, we can 1) shift the distribution towards positive values by adding a constant, or 2) use the Yeo-Johnson transformation (Yeo and Johnson 2000).
In general, transormations can make interpretations more difficult, thus you should think carefully if they are needed, particularly if they only result in modest improvements in model performance. Moreover, finding a suitable transformation is typically a trial-and-error process.
Moreover, if you are transforming the features, you should also consider how this alters the relationship with the target variable.
The Yeo-Johnson transformation is defined as:
$\tilde{y} = \left\{ \begin{array}{l l} \frac{(y+1)^{\lambda} - 1}{\lambda}, & \lambda \neq 0 \text{ and } y \geq 0 \\ \log (y +1), & \lambda = 0 \text{ and } y \geq 0 \\ -\frac{(1-y)^{2-\lambda} - 1}{2-\lambda}, & \lambda \neq 2 \text{ and } y < 0 \\ -\log (1-y), & \lambda = 2 \text{ and } y < 0\end{array} \right.,$
with the Box-Cox transformation as a special case (applied to $y-1$).
Because the parameter of interest is in the exponent, this type of transformation is called a power transformation and is implemented in sklearn's PowerTransformer. The parameter $\lambda$ is estimated from the data, and some values of $\lambda$ relate to common transformations, such as (for $y \geq 0$):
- $\lambda = 1$ (no transformation)
- $\lambda = 0$ (log)
- $\lambda = 0.5$ (square root)
- $\lambda = -1$ (inverse)
Using the code below, if
lmbda=Nonethen the function will "find the lambda that maximizes the log-likelihood function and return it as the second output argument"Notice that we can not use
lambdadirectly since it conflicts with the available object calledlambda, this is the reason we preferred the indicator name aslmbda
fig, axes = plt.subplots(figsize = (15,5), ncols = 4, nrows=2, sharey = True)
axes = axes.flatten()
sns.histplot(data = X['households'], ax = axes[0])
axes[0].set_title("Raw Counts")
for i, lmbda in enumerate([0, 0.25, 0.5, 0.75, 1., 1.25, 1.5]):
house_box_ = stats.boxcox(X['households'].astype(float), lmbda = lmbda)
sns.histplot(data = house_box_, ax = axes[i + 1])
axes[i + 1].set_title(r"$\lambda$ = {}".format(lmbda))
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(figsize = (15,5), ncols = 4, nrows=2, sharey = True)
axes = axes.flatten()
sns.scatterplot(x = X['households'], y = y, ax = axes[0])
axes[0].set_title("Raw Counts")
for i, lmbda in enumerate([0, 0.25, 0.5, 0.75, 1., 1.25, 1.5]):
house_box_ = stats.boxcox(X['households'].astype(float), lmbda = lmbda)
sns.scatterplot(x = house_box_, y = y, ax = axes[i + 1])
axes[i + 1].set_title(r"$\lambda$ = {}".format(lmbda))
plt.tight_layout()
plt.show()
We can find the $\lambda$ that maximizes the log-likelihood function using scipy's boxcox function or sklearn's PowerTransformer.
# Find the MLE for lambda (using scipy's boxcox function)
house_box_, bc_params = stats.boxcox(X['households'].astype(float), lmbda = None)
print(round(bc_params, 2))
# Find the MLE for lambda (using sklearn's PowerTransformer)
from sklearn.preprocessing import PowerTransformer
power_transformer = PowerTransformer(method='box-cox', standardize=False)
X_boxcox = power_transformer.fit_transform(X[['households']])
print(round(power_transformer.lambdas_[0], 2))
0.24 0.24
🚩 Exercise 15 (EXTRA)¶
- For the variable
households, based on theboxcoxtransform shown above, do you think any of the values of $\lambda$ may be useful?
Answer by Sone: I think $\lambda = 0.25$ is the best choice. Because the histogram with $\lambda = 0.25$ shows a perfect bell-shaped distribution and is the most symmetrical among all options. In scatterplot, $\lambda = 0.25$ are spread out evenly across the X-axis. They are no longer squeezed to the left like in the raw data.
- Apply a similar code snippet to
median_house_value. Would any values of $\lambda$ be useful?
Type your answer here!
# Code for your answer here!
fig, axes = plt.subplots(figsize = (15, 5), ncols = 4, nrows = 2) # 去掉了 sharey,因为变换后刻度差异大
axes = axes.flatten()
sns.histplot(y, ax = axes[0])
axes[0].set_title("Raw Counts")
# 2. try different lambda
lambdas = [0, 0.25, 0.5, 0.75, 1., 1.25, 1.5]
for i, lmbda in enumerate(lambdas):
# Box-Cox
y_transformed = stats.boxcox(y, lmbda = lmbda)
# histplot
sns.histplot(y_transformed, ax = axes[i + 1])
axes[i + 1].set_title(r"$\lambda$ = {}".format(lmbda))
plt.tight_layout()
plt.show()
Feature Combinations¶
Looking at the datas attributes we may also want to manually combine them into features that are either a more meaningful representation of the data or have better properties.
For example, we know the number of rooms in a district, but this may be more useful to combine with the number of households so that we have a measure of rooms per household.
rooms_per_household = X['total_rooms'] / X['households']
rooms_per_household.describe()
count 20640.000000 mean 5.429000 std 2.474173 min 0.846154 25% 4.440716 50% 5.229129 75% 6.052381 max 141.909091 dtype: float64
🚩 Exercise 16 (EXTRA)¶
- Can you think of other combinations that may be useful?
Answer by Sone: bedrooms_per_room; population_per_household
- Create a custom transformer that creates these new combinations of features using the
FunctionTransformer.
Hint
What about the following?population_per_householdbedrooms_per_room
# Code for your answer here!
from sklearn.preprocessing import FunctionTransformer
def add_extra_features(X):
X_new = X.copy()
# rooms_per_household
X_new['rooms_per_household'] = X_new['total_rooms'] / X_new['households']
# bedrooms_per_room
X_new['bedrooms_per_room'] = X_new['total_bedrooms'] / X_new['total_rooms']
# population_per_household
X_new['population_per_household'] = X_new['population'] / X_new['households']
return X_new
# 2. FunctionTransformer
# validate=False 允许我们输入 pandas DataFrame,而不是强制转为 numpy array
custom_transformer = FunctionTransformer(add_extra_features, validate=False)
X_with_extra_features = custom_transformer.fit_transform(X)
X_with_extra_features[['rooms_per_household', 'bedrooms_per_room', 'population_per_household']].describe()
| rooms_per_household | bedrooms_per_room | population_per_household | |
|---|---|---|---|
| count | 20640.000000 | 20433.000000 | 20640.000000 |
| mean | 5.429000 | 0.213039 | 3.070655 |
| std | 2.474173 | 0.057983 | 10.386050 |
| min | 0.846154 | 0.100000 | 0.692308 |
| 25% | 4.440716 | 0.175427 | 2.429741 |
| 50% | 5.229129 | 0.203162 | 2.818116 |
| 75% | 6.052381 | 0.239821 | 3.282261 |
| max | 141.909091 | 1.000000 | 1243.333333 |
Other feature types¶
Feature engineering for other feature types beyond numerical categorical are also available in sklearn (e.g. for text and images) and feature engine (e.g. for Datetime and for time series).
🏁 Now, is a good point to switch driver and navigator
Combining into a Pipeline¶
Now, that we are familar with transformers, we are finally ready to create our first model pipeline!
Pipelines are very useful when we want to run data through our pipeline in the future; rather than having to copy and paste a load of code, we can just use our pipeline which combines all the steps. Later on the course, we will see this is important when we split our data into training, validation, and test sets, but this would also be required if you deploy your model in a "live" environment. In particular, pipelines help prevent you from data leakage, i.e. when information from your testing data leaks into your training or model selection. Data leakage is a common reason why many ML models fail to generalize to real world data. Furthermore, when refining a model, pipelines makes it easier for us to add or remove steps of our pipeline to see what works and what doesn't.
Its also worth examining what is meant by a "Pipeline". A general definition is that it is just a sequence of data preparation operations that is ensured to be reproducible. Specifically, in sklearn, Pipeline can contains a sequence of transformer or estimator classes, or, if we use an imbalanced-learn Pipeline instead, also resamplers. This week we have focused on transformers, but later on in the course we will learn about estimators and resamplers. All three of these objects (resamplers, transformers, and estimator) all typically have a .fit() method. We have already seen examples of calling .fit() on transformers. The method works similarly on other classes and is used to
- validate and interpret any parameters,
- validate the input data,
- estimate and store attributes from the parameters and provided data,
- return the fitted estimator to facilitate method chaining in a pipeline.
Along with other sample properties (e.g. sample_weight), the .fit() method usually takes two inputs:
The input matrix (or design matrix) $\mathbf{X}$. The size of $\mathbf{X}$ is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.
The target values $\mathbf{y}$ which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervised learning tasks, $\mathbf{y}$ does not need to be specified.
Other methods available for these objects other than .fit() will depend on what they are, e.g. .transform() for transformers, so we will learn about the methods for others objects later in the course.
This week, our focus is combining different feature engineering steps together to make different model pipelines.
- Remember we want to create a pipeline that treats the numerical and categorical attributes differently.
- We also need to supply the pipeline with an estimator (i.e. model). For now, let's use a linear regression model, which we will learn in more details in week 4.
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
numcols = features[:-1]
catcols = [features[-1]]
# median for missing value
num_pre = Pipeline([
("num_impute", SimpleImputer(strategy="median")),
("num_scale", StandardScaler())])
#one-hot encorder
cat_pre = Pipeline([
("cat_encode", OneHotEncoder(drop='first'))])
reg_pipe_1 = Pipeline([
("pre_processing", ColumnTransformer([("num_pre", num_pre, numcols),
("cat_pre", cat_pre, catcols)],
verbose_feature_names_out=False)),
("model", LinearRegression())
])
# Alternative and equivalent model avoiding nested pipelines
# reg_pipe_1 = Pipeline([
# ("impute", ColumnTransformer([("num_imp", SimpleImputer(strategy="median"), numcols),
# ("cat_imp", SimpleImputer(strategy="constant"), catcols)])),
# ("transform", ColumnTransformer([("num_trns", StandardScaler(), numcols),
# ("cat_trns", OneHotEncoder(drop='first'), catcols)])),
# ("model", LinearRegression())
# ])
display(reg_pipe_1)
Pipeline(steps=[('pre_processing',
ColumnTransformer(transformers=[('num_pre',
Pipeline(steps=[('num_impute',
SimpleImputer(strategy='median')),
('num_scale',
StandardScaler())]),
['longitude', 'latitude',
'housing_median_age',
'total_rooms',
'total_bedrooms',
'population', 'households',
'median_income']),
('cat_pre',
Pipeline(steps=[('cat_encode',
OneHotEncoder(drop='first'))]),
['ocean_proximity'])],
verbose_feature_names_out=False)),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre_processing', ...), ('model', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('num_pre', ...), ('cat_pre', ...)] | |
| remainder | 'drop' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
Parameters
| missing_values | nan | |
| strategy | 'median' | |
| fill_value | None | |
| copy | True | |
| add_indicator | False | |
| keep_empty_features | False |
Parameters
| copy | True | |
| with_mean | True | |
| with_std | True |
['ocean_proximity']
Parameters
| categories | 'auto' | |
| drop | 'first' | |
| sparse_output | True | |
| dtype | <class 'numpy.float64'> | |
| handle_unknown | 'error' | |
| min_frequency | None | |
| max_categories | None | |
| feature_name_combiner | 'concat' |
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
reg_pipe_1.fit(X,y)
# Print the R squared (ranges 0 to 1, with higher values better)
print(round(reg_pipe_1.score(X, y), 3))
0.645
# Print the coeffcients
coef_df = pd.DataFrame({'coef': reg_pipe_1['model'].coef_},
index = reg_pipe_1['pre_processing'].get_feature_names_out())
display(coef_df)
| coef | |
|---|---|
| longitude | -52952.951528 |
| latitude | -53767.624856 |
| housing_median_age | 13312.883346 |
| total_rooms | -10320.060926 |
| total_bedrooms | 29920.765076 |
| population | -44490.477443 |
| households | 29746.222267 |
| median_income | 73636.155864 |
| ocean_proximity_INLAND | -39766.398744 |
| ocean_proximity_ISLAND | 156065.719822 |
| ocean_proximity_NEAR BAY | -3697.401661 |
| ocean_proximity_NEAR OCEAN | 4758.753612 |
Let's try some other combinations of the pre-processing and feature engineering steps that we have learned about this week.
# Reg Pipe 2
# Define column indices
numcols = ['longitude', 'latitude', 'housing_median_age', 'median_income']
countcols = ['total_rooms', 'total_bedrooms', 'population', 'households']
# Reg Pipe 2
num_pre = Pipeline([
("num_scale", StandardScaler())])
count_pre = Pipeline([
("count_impute", SimpleImputer(strategy="median")),
("count_transform", PowerTransformer(method='box-cox', standardize=True))])
cat_pre = Pipeline([
("cat_encode", OneHotEncoder(drop='first'))])
# Overall ML pipeline inlcuding all
reg_pipe_2 = Pipeline([
("pre_processing", ColumnTransformer([
("num_pre", num_pre, numcols),
("count_pre", count_pre, countcols),
("cat_pre", cat_pre, catcols)], verbose_feature_names_out=False)),
("model", LinearRegression())
])
display(reg_pipe_2)
Pipeline(steps=[('pre_processing',
ColumnTransformer(transformers=[('num_pre',
Pipeline(steps=[('num_scale',
StandardScaler())]),
['longitude', 'latitude',
'housing_median_age',
'median_income']),
('count_pre',
Pipeline(steps=[('count_impute',
SimpleImputer(strategy='median')),
('count_transform',
PowerTransformer(method='box-cox'))]),
['total_rooms',
'total_bedrooms',
'population',
'households']),
('cat_pre',
Pipeline(steps=[('cat_encode',
OneHotEncoder(drop='first'))]),
['ocean_proximity'])],
verbose_feature_names_out=False)),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre_processing', ...), ('model', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('num_pre', ...), ('count_pre', ...), ...] | |
| remainder | 'drop' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['longitude', 'latitude', 'housing_median_age', 'median_income']
Parameters
| copy | True | |
| with_mean | True | |
| with_std | True |
['total_rooms', 'total_bedrooms', 'population', 'households']
Parameters
| missing_values | nan | |
| strategy | 'median' | |
| fill_value | None | |
| copy | True | |
| add_indicator | False | |
| keep_empty_features | False |
Parameters
| method | 'box-cox' | |
| standardize | True | |
| copy | True |
['ocean_proximity']
Parameters
| categories | 'auto' | |
| drop | 'first' | |
| sparse_output | True | |
| dtype | <class 'numpy.float64'> | |
| handle_unknown | 'error' | |
| min_frequency | None | |
| max_categories | None | |
| feature_name_combiner | 'concat' |
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
reg_pipe_2.fit(X,y)
# Print the R squared (ranges 0 to 1, with higher values better)
print(round(reg_pipe_2.score(X, y), 3))
0.668
# Print the coeffcients
coef_df = pd.DataFrame({'coef': reg_pipe_2['model'].coef_},
index = reg_pipe_2['pre_processing'].get_feature_names_out())
display(coef_df)
| coef | |
|---|---|
| longitude | -56830.217843 |
| latitude | -59129.677108 |
| housing_median_age | 12933.426923 |
| median_income | 76934.983282 |
| total_rooms | -22094.529289 |
| total_bedrooms | 45298.070771 |
| population | -65930.690370 |
| households | 46349.289219 |
| ocean_proximity_INLAND | -35349.967357 |
| ocean_proximity_ISLAND | 134076.722055 |
| ocean_proximity_NEAR BAY | -7618.609049 |
| ocean_proximity_NEAR OCEAN | -702.953492 |
# Reg Pipe 3
from feature_engine.transformation import LogTransformer
from sklearn.compose import TransformedTargetRegressor
numcols = ['longitude', 'latitude', 'housing_median_age']
skewcols = ['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
num_pre = Pipeline([
("num_scale", StandardScaler())])
skew_pre = Pipeline([
("skew_impute", SimpleImputer(strategy="median")),
("skew_transform", LogTransformer()),
("skew_scale", StandardScaler())])
cat_pre = Pipeline([
("cat_encode", OneHotEncoder(drop='first'))])
# Overall ML pipeline inlcuding all
reg_pipe_3 = Pipeline([
("pre_processing", ColumnTransformer([
("num_pre", num_pre, numcols),
("skew_pre", skew_pre, skewcols),
("cat_pre", cat_pre, catcols)], verbose_feature_names_out=False)),
("model", LinearRegression())
])
# Transform also the target variable
tt_reg_pipe_3 =TransformedTargetRegressor(regressor=reg_pipe_3,
transformer=LogTransformer())
display(tt_reg_pipe_3)
TransformedTargetRegressor(regressor=Pipeline(steps=[('pre_processing',
ColumnTransformer(transformers=[('num_pre',
Pipeline(steps=[('num_scale',
StandardScaler())]),
['longitude',
'latitude',
'housing_median_age']),
('skew_pre',
Pipeline(steps=[('skew_impute',
SimpleImputer(strategy='median')),
('skew_transform',
LogTransformer()),
('skew_scale',
StandardScaler())]),
['total_rooms',
'total_bedrooms',
'population',
'households',
'median_income']),
('cat_pre',
Pipeline(steps=[('cat_encode',
OneHotEncoder(drop='first'))]),
['ocean_proximity'])],
verbose_feature_names_out=False)),
('model',
LinearRegression())]),
transformer=LogTransformer())In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| regressor | Pipeline(step...egression())]) | |
| transformer | LogTransformer() | |
| func | None | |
| inverse_func | None | |
| check_inverse | True |
Parameters
| transformers | [('num_pre', ...), ('skew_pre', ...), ...] | |
| remainder | 'drop' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['longitude', 'latitude', 'housing_median_age']
Parameters
| copy | True | |
| with_mean | True | |
| with_std | True |
['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
Parameters
| missing_values | nan | |
| strategy | 'median' | |
| fill_value | None | |
| copy | True | |
| add_indicator | False | |
| keep_empty_features | False |
Parameters
| variables | None | |
| base | 'e' |
Parameters
| copy | True | |
| with_mean | True | |
| with_std | True |
['ocean_proximity']
Parameters
| categories | 'auto' | |
| drop | 'first' | |
| sparse_output | True | |
| dtype | <class 'numpy.float64'> | |
| handle_unknown | 'error' | |
| min_frequency | None | |
| max_categories | None | |
| feature_name_combiner | 'concat' |
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
LogTransformer()
Parameters
| variables | None | |
| base | 'e' |
tt_reg_pipe_3.fit(X,y)
# Print the R squared (ranges 0 to 1, with higher values better)
print(round(tt_reg_pipe_3.score(X, y), 3))
0.666
# Print the coeffcients
# Note: get_feature_names_out() does not work for LogTransformer
reg3_features = np.concatenate([tt_reg_pipe_3.regressor_['pre_processing']['num_pre'].get_feature_names_out(),
tt_reg_pipe_3.regressor_['pre_processing']['skew_pre'].feature_names_in_,
tt_reg_pipe_3.regressor_['pre_processing']['cat_pre'].get_feature_names_out()]
)
coef_df = pd.DataFrame({'coef': tt_reg_pipe_3.regressor_['model'].coef_},
index = reg3_features)
display(coef_df)
| coef | |
|---|---|
| longitude | -0.325886 |
| latitude | -0.346085 |
| housing_median_age | 0.032441 |
| total_rooms | -0.095483 |
| total_bedrooms | 0.209126 |
| population | -0.294461 |
| households | 0.188207 |
| median_income | 0.342995 |
| ocean_proximity_INLAND | -0.280706 |
| ocean_proximity_ISLAND | 0.465280 |
| ocean_proximity_NEAR BAY | -0.044564 |
| ocean_proximity_NEAR OCEAN | -0.046772 |
🚩 Exercise 17 (CORE)¶
Explain in words what are the differences in pre-processing and/or feature engineering steps used across the three model pipelines above.
Answer by Sone: There are 4 models in pipeline:
In preprocessing, the first pipline only have two categories, and the second and third pipelines have three categories. All of them have num and cat. In the second pipline, it has a count_pre. And in the third pipline, it has a skewed_pre. Both pipeline2 & 3 use Logtransformer to handle skewed. Finally, they use OneHotEncoder be their encoder.
🚩 Exercise 18 (EXTRA)¶
Try to create your own pipeline by modifying at least one of the pre-processing and feature engineering steps above. What have you decided to change and why?
# Exercise 18
from sklearn.preprocessing import PowerTransformer
numcols = ['longitude', 'latitude', 'housing_median_age']
skewcols = ['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
catcols = ['ocean_proximity']
num_pre = Pipeline([
("num_scale", StandardScaler())])
# 修改: 用 yeo-johnson 代替 LogTransformer(可以处理零值和负值)
skew_pre = Pipeline([
("skew_impute", SimpleImputer(strategy="median")),
("skew_transform", PowerTransformer(method='yeo-johnson', standardize=True))])
cat_pre = Pipeline([
("cat_encode", OneHotEncoder(drop='first'))])
reg_pipe_4 = Pipeline([
("pre_processing", ColumnTransformer([
("num_pre", num_pre, numcols),
("skew_pre", skew_pre, skewcols),
("cat_pre", cat_pre, catcols)], verbose_feature_names_out=False)),
("model", LinearRegression())
])
# 显示 Pipeline 图
display(reg_pipe_4)
# 训练并输出 R² 分数
reg_pipe_4.fit(X, y)
print(f"R² score: {round(reg_pipe_4.score(X, y), 3)}")
Pipeline(steps=[('pre_processing',
ColumnTransformer(transformers=[('num_pre',
Pipeline(steps=[('num_scale',
StandardScaler())]),
['longitude', 'latitude',
'housing_median_age']),
('skew_pre',
Pipeline(steps=[('skew_impute',
SimpleImputer(strategy='median')),
('skew_transform',
PowerTransformer())]),
['total_rooms',
'total_bedrooms',
'population', 'households',
'median_income']),
('cat_pre',
Pipeline(steps=[('cat_encode',
OneHotEncoder(drop='first'))]),
['ocean_proximity'])],
verbose_feature_names_out=False)),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('pre_processing', ...), ('model', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('num_pre', ...), ('skew_pre', ...), ...] | |
| remainder | 'drop' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | False | |
| force_int_remainder_cols | 'deprecated' |
['longitude', 'latitude', 'housing_median_age']
Parameters
| copy | True | |
| with_mean | True | |
| with_std | True |
['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
Parameters
| missing_values | nan | |
| strategy | 'median' | |
| fill_value | None | |
| copy | True | |
| add_indicator | False | |
| keep_empty_features | False |
Parameters
| method | 'yeo-johnson' | |
| standardize | True | |
| copy | True |
['ocean_proximity']
Parameters
| categories | 'auto' | |
| drop | 'first' | |
| sparse_output | True | |
| dtype | <class 'numpy.float64'> | |
| handle_unknown | 'error' | |
| min_frequency | None | |
| max_categories | None | |
| feature_name_combiner | 'concat' |
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
R² score: 0.623
Advantages of Yeo-Johnson by Sone:
Handles zero and negative values - Log requires strictly positive data Automatically learns optimal λ - adapts to data distribution More flexible - generalizes to different skewness patterns Built-in standardization option - standardize=True combines two steps
Summary ¶
This week we covered a lot of ground!
We've looked at some methods for pre-processing our data, cleaning and preparing it, as well as how to engineer some features and combine these steps into a reproducible pipeline.
This is by no means a complete collection of all the methods available as covering more would go beyond the scope of this course (for those interested in learning more, have a look though the given companion readings).
For example, we did not touch on handling text and dates/time much. These topics are quite complex and have enough materials to cover their own courses.
Competing the Worksheet¶
At this point you have hopefully been able to complete all the CORE exercises and attempted the EXTRA ones. Now is a good time to check the reproducibility of this document by restarting the notebook's kernel and rerunning all cells in order.
Before generating the PDF, please change 'Student 1' and 'Student 2' at the top of the notebook to include your name(s).
Once that is done and you are happy with everything, you can then run the following cell to generate your PDF.
!jupyter nbconvert --to pdf mlp_week01.ipynb
[NbConvertApp] Converting notebook mlp_week01.ipynb to pdf
[NbConvertApp] Support files will be in mlp_week01_files\
[NbConvertApp] Making directory .\mlp_week01_files
[NbConvertApp] Writing 139624 bytes to notebook.tex
[NbConvertApp] Building PDF
Traceback (most recent call last):
File "F:\anaconda\envs\mlp\Scripts\jupyter-nbconvert-script.py", line 10, in <module>
sys.exit(main())
^^^^^^
File "F:\anaconda\envs\mlp\Lib\site-packages\jupyter_core\application.py", line 284, in launch_instance
super().launch_instance(argv=argv, **kwargs)
File "F:\anaconda\envs\mlp\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
app.start()
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\nbconvertapp.py", line 420, in start
self.convert_notebooks()
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\nbconvertapp.py", line 597, in convert_notebooks
self.convert_single_notebook(notebook_filename)
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\nbconvertapp.py", line 563, in convert_single_notebook
output, resources = self.export_single_notebook(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\nbconvertapp.py", line 487, in export_single_notebook
output, resources = self.exporter.from_filename(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\exporters\templateexporter.py", line 390, in from_filename
return super().from_filename(filename, resources, **kw) # type:ignore[return-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\exporters\exporter.py", line 201, in from_filename
return self.from_file(f, resources=resources, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\exporters\templateexporter.py", line 396, in from_file
return super().from_file(file_stream, resources, **kw) # type:ignore[return-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\exporters\exporter.py", line 220, in from_file
return self.from_notebook_node(
^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\exporters\pdf.py", line 197, in from_notebook_node
self.run_latex(tex_file)
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\exporters\pdf.py", line 166, in run_latex
return self.run_command(
^^^^^^^^^^^^^^^^^
File "F:\anaconda\envs\mlp\Lib\site-packages\nbconvert\exporters\pdf.py", line 120, in run_command
raise OSError(msg)
OSError: xelatex not found on PATH, if you have not installed xelatex you may need to do so. Find further instructions at https://nbconvert.readthedocs.io/en/latest/install.html#installing-tex.
!jupyter nbconvert --to html mlp_week01.ipynb
[NbConvertApp] Converting notebook mlp_week01.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 7 image(s). [NbConvertApp] Writing 4412947 bytes to mlp_week01.html
Once generated, please submit this PDF on Learn page by 16:00 PM on the Friday of the week the workshop was given. Note that:
- You don't need to finish everything, but you should have had a substantial attempt at the bulk of the material, particularly the CORE tasks.
- If you are having trouble generating the pdf, please ask a tutor or post on piazza.
- As a back option, if you are having errors in converting to pdf, then a quick solution is to export to html and then convert to pdf in your browser.